Expanding the scope of 2D thread distribution to improve multi-threaded DGEMM performance #4655
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pull request proposes a fix to the issue #4644 .
The currently implemented 2D thread distribution in level3_thread.c works well for small matrices, however, it falls into a simple one-dimensional distribution in the M direction as the size of matirix becomes larger. This pull request improves thread parallell performance for large matrices by expanding the scope of 2D thread distribution.
Performance improved by about 10% on Graviton3E (64 cores) and more than 20% on Xeon Platinum 8375C (32 cores x 2 sockets).
Graviton3E
Xeon Platinum 8375C
The calculations are distributed so that each thread handles about the same size of the range in the M and N directions, even when the input matrices are rectangle. Although not confirmed on all platforms, this relatively simple fix is expected to be generally effective on modern manycore CPUs.